1 🔍 Analysis

1.1 Identify and remove the direct identifiers from the data.

Out of 52 variables, below 7 were considered as direct identifiers :

  • IPAddress
  • LocationLatitude
  • LocationLongitude
    • The above 3 variables can help directly in detecting the location of the respondent.
    • In order to not reduce the utility of the data, postcode is not removed and will be helpful in doing some area-wise analysis.
  • ResponseID
  • ResponseLastName
  • ResponseFirstName
  • QID28 (Containing email address of the respondent)
    • The above 4 will easily help/convey the individual identity of a respondent

In addition, class type of below variables was changed to factorial as they contained categorical values:

QID12, QID10, QID14, QID16, QID20, QID23, QID26, QID27, QID17_1, QID17_2, QID17_3, QID17_4, QID17_5, QID18_1, QID18_2, QID18_3, QID18_4, QID18_5, QID24_1, QID24_2, QID24_3, QID24_4, QID24_5, QID24_6, QID25_1, QID25_2, QID25_3, QID25_4, QID25_5, QID25_6

Lastly, a data collection error of 2 variables namely QID10 and QID16 was also fixed.

  • These variables collected option value rather than option number for 1 categorical level.
  • It was resolved by changing the class of variable to char and then rep;lacing all the issue values and changing the class to factorial then.
data %>% 
  select(-c(IPAddress,ResponseID,ResponseLastName,ResponseFirstName,LocationLatitude,LocationLongitude,QID28)) %>% 
  mutate(Finished = as.factor(Finished),
         QID12 = as.factor(QID12),
         QID10 = as.character(QID10),
         QID14 = as.factor(QID14),
         QID16 = as.character(QID16),
         QID20 = as.factor(QID20),
         QID23 = as.factor(QID23),
         QID26 = as.factor(QID26),
         QID27 = as.factor(QID27),
         QID17_1 = as.factor(QID17_1),
         QID17_2 = as.factor(QID17_2),
         QID17_3 = as.factor(QID17_3),
         QID17_4 = as.factor(QID17_4),
         QID17_5 = as.factor(QID17_5),
         QID18_1 = as.factor(QID18_1),
         QID18_2 = as.factor(QID18_2),
         QID18_3 = as.factor(QID18_3),
         QID18_4 = as.factor(QID18_4),
         QID18_5 = as.factor(QID18_5),
         QID24_1 = as.factor(QID24_1),
         QID24_2 = as.factor(QID24_2),
         QID24_3 = as.factor(QID24_3),
         QID24_4 = as.factor(QID24_4),
         QID24_5 = as.factor(QID24_5),
         QID24_6 = as.factor(QID24_6),
         QID25_1 = as.factor(QID25_1),
         QID25_2 = as.factor(QID25_2),
         QID25_3 = as.factor(QID25_3),
         QID25_4 = as.factor(QID25_4),
         QID25_5 = as.factor(QID25_5),
         QID25_6 = as.factor(QID25_6)) %>% 
  mutate(QID10 = str_replace(QID10,"0","1"),
         QID16 = str_replace(QID16,"0","1")) %>% 
  mutate(QID10 = as.factor(QID10),
         QID16 = as.factor(QID16)) %>% 
  head() %>%
  kable(caption = "Data after deleting the direct identifiers") %>% 
  kable_styling(bootstrap_options = c("basc","striped","hover"))
Table 1.1: Data after deleting the direct identifiers
id_number StartDate Finished Progress Duration EndDate RecordedDate QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27
1 2021-01-30 1 100 1588.3878 2021-01-30 2021-01-30 1 42 3 1 3186 107512.69 122311.5 118999.41 2 2 6 5 1 0 1 0 1 1 0 0 1 0 1 4 0 0 1 1 0 0 0 0 0 1 0 1 3 1
2 2021-01-22 1 100 1117.5621 2021-01-22 2021-01-22 1 68 2 4 3049 75550.19 0.0 100405.64 6 1 6 1 0 1 1 0 0 NA NA NA NA NA 4 NA 0 0 1 0 0 0 1 0 0 0 0 1 3 2
3 2021-01-12 1 100 372.0479 2021-01-12 2021-01-12 1 51 1 0 3201 88755.34 100283.9 97700.49 3 6 4 4 1 0 0 0 0 0 0 0 1 0 4 1 1 1 1 1 0 0 1 0 0 0 1 0 2 3
4 2021-01-10 1 100 100.5913 2021-01-10 2021-01-10 1 53 3 2 3068 271736.36 295452.9 292357.52 2 2 6 4 0 0 0 1 0 0 0 0 1 0 1 2 1 0 1 0 0 1 1 0 1 1 0 0 1 1
5 2021-01-09 1 100 874.8889 2021-01-09 2021-01-09 1 43 3 0 3119 172966.94 189771.7 195586.28 4 2 3 6 0 0 1 0 0 0 0 0 1 0 1 4 0 0 1 0 1 1 1 0 0 0 0 0 2 3
6 2021-01-03 1 100 4367.4938 2021-01-04 2021-01-04 1 70 1 2 3139 115480.48 0.0 130480.92 2 1 6 1 1 1 0 0 0 NA NA NA NA NA 1 NA 0 0 0 0 1 0 1 1 0 0 1 0 2 1

1.2 De-identification strategy

In the above 1.1, we can still see that,there some variables like age, QID21,QID22 can be used to identify a person. In order to make the data more protected and also making sure that the utility of the data s not reduced, we performed the following de-identification technique:

  • On age, aggregation was applied to convert into age_groups of 5 years.
  • On both QID21,QID22, Perturbation was used because:
    • Using this technique, the utility of the data is not reduced as it is in accordance with age groups.
    • Individual data gets censored,which reduces the identification risk.

Moreover, I have also made below updations:

  • Removed the unwanted variables (Will be of no use in analysis)
  • Converted the composite variable like RecordedDate into separate variables,
  • Renamed other variables to some meaningful names.
data1 %>% 
  select(-c(StartDate,EndDate,QID29)) %>% 
  separate(RecordedDate,into = c("Year","Month","Day")) %>% 
  mutate(Year = as.numeric(Year),
         Month = as.numeric(Month),
         Day = as.numeric(Day)) %>% 
  rename(age = QID6,
         adults = QID7,
         children = QID8,
         postcode = QID15,
         exp_house_inc_2021 = QID19,
         house_inc_2020 = QID21,
         house_inc_2019 = QID22,
         work_mood_2019 = QID12,
         work_mood_2020 = QID10,
         avg_work_hour_2019 = QID14,
         avg_work_hour_2020 = QID16,
         TDW_2019 = QID17_1,
         EM_2019 = QID17_2,
         EE_2019 = QID17_3,
         LT_2019 = QID17_4,
         ON_2019 = QID17_5,
         TDW_2020 = QID18_1,
         EM_2020 = QID18_2,
         EE_2020 = QID18_3,
         LT_2020 = QID18_4,
         ON_2020 = QID18_5,
         work_stability_2019 = QID20,
         work_stability_2020 = QID23,
         comfortable_2019 = QID24_1,
         lonely_2019 = QID24_2,
         active_2019 = QID24_3,
         connected_2019 = QID24_4,
         peaceful_2019 = QID24_5,
         chaotic_2019 = QID24_6,
         comfortable_2020 = QID25_1,
         lonely_2020 = QID25_2,
         active_2020 = QID25_3,
         connected_2020 = QID25_4,
         peaceful_2020 = QID25_5,
         chaotic_2020 = QID25_6,
         mental_cond_2019 = QID26,
         mental_cond_2020 = QID27) %>%
  mutate(age_group = cut(age, breaks = 5)) %>%
  group_by(age_group) %>%
  mutate(new_house_inc_2019 = sample(house_inc_2019,n(),replace = FALSE),
         new_house_inc_2020 = sample(house_inc_2020,n(),replace = FALSE)) %>%
  select(-c(age,house_inc_2020,house_inc_2019)) %>% 
   head() %>%
  kable(caption = "Data after applying de-identification technique and renaming variables") %>% 
  kable_styling(bootstrap_options = c("basc","striped","hover"))
Table 1.2: Data after applying de-identification technique and renaming variables
id_number Finished Progress Duration Year Month Day adults children postcode exp_house_inc_2021 work_mood_2019 work_mood_2020 avg_work_hour_2019 avg_work_hour_2020 TDW_2019 EM_2019 EE_2019 LT_2019 ON_2019 TDW_2020 EM_2020 EE_2020 LT_2020 ON_2020 work_stability_2019 work_stability_2020 comfortable_2019 lonely_2019 active_2019 connected_2019 peaceful_2019 chaotic_2019 comfortable_2020 lonely_2020 active_2020 connected_2020 peaceful_2020 chaotic_2020 mental_cond_2019 mental_cond_2020 age_group new_house_inc_2019 new_house_inc_2020
1 1 100 1588.3878 2021 1 30 3 1 3186 118999.41 2 2 6 5 1 0 1 0 1 1 0 0 1 0 1 4 0 0 1 1 0 0 0 0 0 1 0 1 3 1 (35.2,48.4] 90858.26 0.00
2 1 100 1117.5621 2021 1 22 2 4 3049 100405.64 6 1 6 1 0 1 1 0 0 NA NA NA NA NA 4 NA 0 0 1 0 0 0 1 0 0 0 0 1 3 2 (61.6,74.8] 186000.99 93551.37
3 1 100 372.0479 2021 1 12 1 0 3201 97700.49 3 6 4 4 1 0 0 0 0 0 0 0 1 0 4 1 1 1 1 1 0 0 1 0 0 0 1 0 2 3 (48.4,61.6] 148676.79 142314.86
4 1 100 100.5913 2021 1 10 3 2 3068 292357.52 2 2 6 4 0 0 0 1 0 0 0 0 1 0 1 2 1 0 1 0 0 1 1 0 1 1 0 0 1 1 (48.4,61.6] 174018.00 82228.10
5 1 100 874.8889 2021 1 9 3 0 3119 195586.28 4 2 3 6 0 0 1 0 0 0 0 0 1 0 1 4 0 0 1 0 1 1 1 0 0 0 0 0 2 3 (35.2,48.4] 167641.89 90313.35
6 1 100 4367.4938 2021 1 4 1 2 3139 130480.92 2 1 6 1 1 1 0 0 0 NA NA NA NA NA 1 NA 0 0 0 0 1 0 1 1 0 0 1 0 2 1 (61.6,74.8] 152334.23 0.00

1.3 Check strategy

1.3.1 Can the new generated house_inc_2019 be used to identify a person?

The below 1.1 showcases that:

  • There are some extreme values which might help in identifying a person.
  • One solution to this is remove these values or change them using the top-bottom technique.

To have a better insight look at table 1.3

plt1 <- data2 %>%
  ggplot(aes(y = new_house_inc_2019)) +
  geom_boxplot() +
  ylab("Household Income 2019")
ggplotly(plt1)

Figure 1.1: Value Distribution of household income in 2019

data2 %>%
  select(new_house_inc_2019) %>% 
  slice_max(new_house_inc_2019,n=2) %>% 
  kable(caption = "Tabular Distribution of top 2 household income in 2019 w.r.t age group") %>% 
  kable_styling(bootstrap_options = c("basc","striped","hover"))
Table 1.3: Tabular Distribution of top 2 household income in 2019 w.r.t age group
age_group new_house_inc_2019
(21.9,35.2] 216350.2
(21.9,35.2] 214772.6
(35.2,48.4] 292182.7
(35.2,48.4] 285240.3
(48.4,61.6] 311704.4
(48.4,61.6] 271736.4
(61.6,74.8] 284878.0
(61.6,74.8] 255025.4
(74.8,88.1] 158431.9
(74.8,88.1] 126481.9

Therefore table 1.3 clearly states that removal/updation of the extreme values is not required because:

  • Due to the addition of derived variable age_group, it is impossible to recognize the exact age of a person.
  • In addition the perturbation technique used above makes the protects by assigning the extreme values to a different person in a same age_group.
  • Moreover the age_group in which the max. value seen is (48.4,61.6], which itself has the max. occurrence of values in this data, making it again a cumbersome task to figure the original identity. (see table 1.4)
data2 %>%
  select(age_group) %>% 
  summary() %>% 
  kable(caption = "Tabular Distribution of top 2 household income in 2019 w.r.t age group") %>% 
  kable_styling(bootstrap_options = c("basc","striped","hover"))
Table 1.4: Tabular Distribution of top 2 household income in 2019 w.r.t age group
age_group
(21.9,35.2]: 66
(35.2,48.4]:359
(48.4,61.6]:441
(61.6,74.8]:126
(74.8,88.1]: 7
NA’s : 1

The above table 1.4 also confirms that the aggregation technique has successfully made the age variable as a low risk gateway to know the identity of a person.

1.3.2 Can the new generated house_inc_2020 be used to identify a person?

No, the newly generated variable house_inc_2020 cannot be used to identify a person and the same can be inferred clearly from figure 1.2 and table 1.5.

plt2 <- data2 %>%
  ggplot(aes(y = new_house_inc_2020)) +
  geom_boxplot() +
  ylab("Household Income 2020")
ggplotly(plt2)

Figure 1.2: Value Distribution of household income in 2020

data2 %>%
  select(new_house_inc_2020) %>% 
  slice_max(new_house_inc_2020,n=2) %>% 
  kable(caption = "Tabular Distribution of top 2 household income in 2019 w.r.t age group") %>% 
  kable_styling(bootstrap_options = c("basc","striped","hover"))
Table 1.5: Tabular Distribution of top 2 household income in 2019 w.r.t age group
age_group new_house_inc_2020
(21.9,35.2] 248790.4
(21.9,35.2] 236554.6
(35.2,48.4] 310998.3
(35.2,48.4] 297613.1
(48.4,61.6] 298437.0
(48.4,61.6] 295452.9
(61.6,74.8] 274251.0
(61.6,74.8] 266758.2
(74.8,88.1] 183463.5
(74.8,88.1] 133377.7

1.4 Computer readable structure

For a data to be finally interpreted by a machine, a well structured and easily readable computer format is required.The most common machine-readable format is “Comma Separated Variables” (CSV), which is provided directly by many standard database and spreadsheet products. CSV stores tabular data in a text-based format, making it easily exchanged by machines.

1.5 Save data in a csv form in the data folder

write_csv(data2,here::here("data/release-data-Garg-Karan.csv"))

```

1.6 Resources

A. (2021) Xie (2021a) Wickham et al. (2021) Wickham et al. (2020) Müller (2020) Zhu (2021) Xie (2021b) Sievert et al. (2021) Wickham (2021)

A., Kennedy L. 2021. Simulated Data Survey for Assignment 3 Etc5512.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2021. Plotly: Create Interactive Web Graphics via Plotly.js. https://CRAN.R-project.org/package=plotly.
Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2020. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Xie, Yihui. 2021a. Bookdown: Authoring Books and Technical Documents with r Markdown. https://CRAN.R-project.org/package=bookdown.
———. 2021b. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.